---------------------------
Authors: Annamaria Conti (HEC Lausanne)     and 
	Jorge Guzman (Columbia University)

This version: April, 2021

Contact:
     Annamaria Conti: annamaria.conti@unil.ch 
     Jorge Guzman: jag2367@gsb.columbia.edu

--------------------------



---------------------------
Overview
--------------------------

This is the data appendix to 

Conti, Annamaria, and Jorge Guzman. 2021. 'What is the US Comparative Advantage on 
       Entrepreneurship? Evidence from Israeli Migration to the United States". 
       forthcoming in The Review of Economics and Statistics. 


It contains all necessary scripts to replicate our results.  The paper focuses on using
high dimensional data to assess the impact of moving to the U.S. on Israeli migrants. 

All statistical analysis is done in Stata. The machine learning random forest model
is run in Python, using a port authored by us to call the Python scikit-learn random forest
library.  



--------------------------
File Structure
--------------------------


Main Stata .do files
---------------

There are two main Stata do files that run our analyses.
      - selection_results.do Reports all results related to the selection process into 
  			    migration.  Section 3 of our paper.

      - Main_Results.do Runs all the empirical analyses reported in our paper and outputs 
      			tables into LaTeX.  These tables follow the format of the working paper
			version of our paper. 


Each of the files works with a series of global macros, that turn on and off each of the analyses.
All global macros are defined at the beggining of the .do files with names that are self
explanatory. 


Other Stata helper files
-----------------
Our approach also uses several additional Stata files. 
    - randomforest.ado is a port of the scikit-learn random forest model to allow running a random 
      		       forest in Stata.  This code could be updated in the future, please refer to 
		       www.jorgeguzman.co for the latest version.

   
    - selected_ml.do This is a file that stores the selected variables from LASSO for each dependent
      		     variable to avoid the cost of having to run this procedure every time, since 
		     variable selection can take a couple of hours per dependent variable.


    - lassoShooting.ado Authored by Christian Hansen, it performs LASSO variable selection which is
      			used as a first step in machine learning in this data. Note that because it
			uses random sampling and cross-validation it will pick different specific 
			variables in each iteration if there are some that are closely correlated.


    - label_variables.do Adds more labels to the variables.

Python code
------------------
      -randomforest/stata_randomforest_only.py is a file that is called by randomforest.ado to
      					      executre the random forest model.


Data files
-----------------

There are several data files depending on the specification to be run.
      - ml_cross_sectional.dta is the most common file used.  It is a cross sectional file (one observation)
      			       per firm, with all machine learning variables created. 

      - migration_panel.dta is the dataset in panel format used for the migration analyses that
      			    require panel data.